Using the IPython Notebook for Reproducible Parallel Computing


In [1]:
from IPython.display import display, Image, HTML
from talktools import website, nbviewer

SIAM Conference on Parallel Processing for Scientific Computing (PP 2014)

Brian E. Granger (@ellisonbg)

Physics Professor, Cal Poly

Core developer, IPython Project

The IPython Project

IPython is an open source, interactive computing environment for Python and other languages.


In [2]:
website('http://ipython.org')


Out[2]:
  • Started in 2001 by Fernando Perez, who continues to lead the project from UC Berkeley
  • Open source, BSD license
  • Started as an enhanced interactive Python shell:

  • Today, IPython is a powerful architecture for interactive code execution:
    • Language independent message specification for running code in and getting results from remote processes (JSON over WebSockets and ZeroMQ).
    • IPython Frontends: web-based notebook, Qt Console, terminal console
    • IPython.parallel: interactive parallel computing

See the following talk on IPython.parallel by Min Ragan-Kelley

Funding

  • Over the past 13 years, much of IPython has been "funded" by volunteer developer time.
  • Past funding: NASA, DOD, NIH, Enthought Corporation
  • Current funding:

Development team

  • IPython is developed by a talented team of $\approx15$ core developers and a larger community of $\approx100$ contributors.
  • Through the above funding sources, there are currently 6 full time people working on IPython at UC Berkeley and Cal Poly.

In [3]:
import ipythonproject

In [4]:
ipythonproject.core_devs()


Fernando Perez

Brian Granger

Min Ragan-Kelley

Thomas Kluyver

Matthias Bussonnier

Jonathan Frederic

Paul Ivanov

Evan Patterson

Damian Avila

Brad Froehle

Zach Sailer

Robert Kern

Jorgen Stenarson

Jonathan March

Kyle Kelley

The IPython Notebook

The IPython Notebook is a web-based interactive computing environment that spans the full range of computing related activities:

  1. Individual exploration, analysis and visualization
  2. Debugging, testing
  3. Production runs
  4. Parallel computing
  5. Collaboration
  6. Publication
  7. Presentation
  8. Teaching/Learning

How does IPython target these different activities?

Interactive exploration

The central focus of IPython is the writing and running of code. We try to make this as pleasant as possible:

  • Multiline editing
  • Tab completion
  • Integrated help
  • Syntax highlighting
  • System shell access

Let's use NumPy and Matplotlib to look at the eigenvalue spacing distribution of random matrices:


In [5]:
%matplotlib inline

In [6]:
import matplotlib.pyplot as plt
import seaborn
import numpy as np
ra = np.random
la = np.linalg

In [7]:
def GOE(N):
    """Creates an NxN element of the Gaussian Orthogonal Ensemble"""
    m = ra.standard_normal((N,N))
    m += m.T
    return m/2

def center_eigenvalue_diff(mat):
    """Compute the eigvals of mat and then find the center eigval difference."""
    N = len(mat)
    evals = np.sort(la.eigvals(mat))
    diff = np.abs(evals[N/2] - evals[N/2-1])
    return diff

def ensemble_diffs(num, N):
    """Return num eigenvalue diffs for the NxN GOE ensemble."""
    diffs = np.empty(num)
    for i in range(num):
        mat = GOE(N)
        diffs[i] = center_eigenvalue_diff(mat)
    return diffs/diffs.mean()

In [8]:
diffs = ensemble_diffs(1000,30)

In [9]:
plt.hist(diffs, bins=30, normed=True)
plt.xlabel('Normalized eigenvalue spacing s')
plt.ylabel('Probability $P(s)$')


Out[9]:
<matplotlib.text.Text at 0x10b6c2ad0>

Common shell commands (ls, cd) just work:


In [10]:
ls


LICENSE             data/               ipythonproject.pyc  load_style.pyc      talktools.py
README.md           images/             ipythonteam/        lorenz.py           talktools.pyc
SIAM Talk.ipynb     ipythonproject.py   load_style.py       talk.css

Manage small files in the notebook using the %%writefile magic command:


In [11]:
%%writefile data/mydata.csv
0 1 2 3 4 5 6 7 8 9 10


Overwriting data/mydata.csv

Any command prefixed with the ! is run in the system shell:


In [12]:
!cat data/mydata.csv


0 1 2 3 4 5 6 7 8 9 10

What does this have to do with parallel computing?

The canonical user interface to clusters and supercomputers is a terminal session over SSH. Ouch. This is extremely painful for the user and makes it almost impossible to reproduce the workflow. Here is a simple recipe for making parallel computing reproducible and literate:

  1. Install and run the IPython Notebook on the head node
  2. Write notebooks that create input files, submit jobs, perform post processing, visualization
  3. Provide inline narrative descriptions of the workflow
  4. Store the notebooks in a version control system (git, svn, etc.)

Multiple backend languages

Scientific computing is a multi-language activity. Python, C, C++, Fortran, Perl, Bash, etc. The IPython architecture is language agnostic.

For statistical computing, R is a great option. Let's fit a linear model in R and visualize the results:


In [13]:
import numpy as np
X = np.array([0,1,2,3,4])
Y = np.array([3,5,4,6,7])
%load_ext rmagic

The %%R syntax tells IPython to run the rest of the cell as R code:


In [14]:
%%R -i X,Y -o XYcoef
XYlm = lm(Y~X)
XYcoef = coef(XYlm)
print(summary(XYlm))
par(mfrow=c(2,2))
plot(XYlm)


Call:
lm(formula = Y ~ X)

Residuals:
   1    2    3    4    5 
-0.2  0.9 -1.0  0.1  0.2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   3.2000     0.6164   5.191   0.0139 *
X             0.9000     0.2517   3.576   0.0374 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7958 on 3 degrees of freedom
Multiple R-squared:   0.81,	Adjusted R-squared:  0.7467 
F-statistic: 12.79 on 1 and 3 DF,  p-value: 0.03739

This %%language syntax is an IPython specific extension to the Python language. This "magic command syntax" allows Python code to call out to a wide range of other languages (Ruby, Bash, Julia, Fortran, Perl, Octave, Matlab, etc.)


In [15]:
%%ruby
puts "Hello from Ruby #{RUBY_VERSION}"


Hello from Ruby 2.0.0

In [16]:
%%bash
echo "hello from $BASH"


hello from /bin/bash

Native kernels

In the IPython architecture, the kernel is a separate process that runs the user's code and returns the output back to the frontend (Notebook, Terminal, etc.). Kernels talk to frontends using a well documented message protocol (JSON over ZeroMQ and WebSockets). The default kernel that ships with IPython knows how to run Python code. However, there are now kernels in other languages:

By later this year, all users of the IPython Notebook will have the option to choose what type of kernel to use for each Notebook.

Here is a notebook that runs code in the native Julia kernel:


In [17]:
website("http://nbviewer.ipython.org/url/jdj.mit.edu/~stevenj/IJulia%20Preview.ipynb")


Out[17]:

Notebook documents

Notebook documents are just JSON files stored on your filesystem. These files store everything related to a computation:

  • Code
  • Output (text, HTML, plots, images, JavaScript)
  • Narrative text (Markdown with embedded LaTeX math)

Notebook documents can be shared:

  • GitHub repos
  • Email
  • Dropbox
  • Internal shared file systems

Notebook documents can be viewed by anyone on the web through http://nbviewer.ipython.org


In [18]:
website("http://nbviewer.ipython.org")


Out[18]:

This allows people to compose and share reproducible stories that involve code and data.

Earlier this year, Randall Munroe (xkcd) published a comic about regular expression golf. Peter Norvig from Google wanted to explore some of the algorithms related to this comic and shared his explorations as a notebook on nbviewer:


In [20]:
website("http://nbviewer.ipython.org/url/norvig.com/ipython/xkcd1313.ipynb")


Out[20]:

Rich output

IPython has a display system for rich output formats. This rich display system allows Python objects to declare non-textual representations that can be displayed in the Notebook. These rich representations include:

  • PNG/JPEG
  • HTML
  • JavaScript
  • LaTeX
  • SVG

These rich representaions are displayed using IPython's display function:


In [21]:
from IPython.display import HTML, Image, YouTubeVideo, Audio, Latex

Here is an Image object whose representation is an image:


In [22]:
i = Image('images/ipython_logo.png')

In [23]:
display(i)


The Audio object has a representation that is an HTML5 audio player:


In [24]:
a = Audio('data/Bach Cello Suite #3.wav')

In [25]:
display(a)


The Latex object produces a representation that is rendered LaTeX. In this case, Maxwell's equations:


In [26]:
Latex(r"""\begin{eqnarray}
\nabla \times \vec{\mathbf{B}} -\, \frac1c\, \frac{\partial\vec{\mathbf{E}}}{\partial t} & = \frac{4\pi}{c}\vec{\mathbf{j}} \\
\nabla \cdot \vec{\mathbf{E}} & = 4 \pi \rho \\
\nabla \times \vec{\mathbf{E}}\, +\, \frac1c\, \frac{\partial\vec{\mathbf{B}}}{\partial t} & = \vec{\mathbf{0}} \\
\nabla \cdot \vec{\mathbf{B}} & = 0 
\end{eqnarray}""")


Out[26]:
\begin{eqnarray} \nabla \times \vec{\mathbf{B}} -\, \frac1c\, \frac{\partial\vec{\mathbf{E}}}{\partial t} & = \frac{4\pi}{c}\vec{\mathbf{j}} \\ \nabla \cdot \vec{\mathbf{E}} & = 4 \pi \rho \\ \nabla \times \vec{\mathbf{E}}\, +\, \frac1c\, \frac{\partial\vec{\mathbf{B}}}{\partial t} & = \vec{\mathbf{0}} \\ \nabla \cdot \vec{\mathbf{B}} & = 0 \end{eqnarray}

The YouTubeVideo object embeds the HTML for a YouTube video in the notebook:


In [27]:
YouTubeVideo('sjfsUzECqK0')


Out[27]:

Interacting with data

Data exploration is an iterative process that involves repeated passes at visualization, interaction and computation:


In [28]:
Image('images/VizInteractCompute.png')


Out[28]:

Right now this cycle is still really painful:

  • It takes too long to go through a single iteration
  • Even when we are successful, the overall process is not reproducible
  • Difficult to repeat, generalize or share with others
  • Massive cognitive load that has nothing to do with extracting insight from the data

For IPython 2.0 we have built an architecture that allows Python and JavaScript to communicate seamlessly and in real time. This allows users to easily iterate through this cycle.

Image editing

In this example, we will perform some basic image processing using scikit-image.


In [29]:
from IPython.html.widgets import *

In [30]:
import skimage
from skimage import data, filter, io

In [33]:
i = data.coffee()
io.Image(i)


Out[33]:

In [34]:
def edit_image(image, sigma=0.1, r=1.0, g=1.0, b=1.0):
    new_image = filter.gaussian_filter(image, sigma=sigma, multichannel=True)
    new_image[:,:,0] = r*new_image[:,:,0]
    new_image[:,:,1] = g*new_image[:,:,1]
    new_image[:,:,2] = b*new_image[:,:,2]
    new_image = io.Image(new_image)
    display(new_image)
    return new_image

Calling the function once, displays and returns the edited image:


In [35]:
new_i = edit_image(i, 0.5, r=0.5);



In [36]:
lims = (0.0,1.0,0.01)
interact(edit_image, image=fixed(i), sigma=(0.0,10.0,0.1), r=lims, g=lims, b=lims);


Lorenz system

Let's explore the Lorenz system of differential equations:

$$ \begin{aligned} \dot{x} & = \sigma(y-x) \\ \dot{y} & = \rho x - y - xz \\ \dot{z} & = -\beta z + xy \end{aligned} $$

This is one of the classic systems in non-linear differential equations. It exhibits a range of different behaviors as the parameters ($\sigma$, $\beta$, $\rho$) are varied.


In [37]:
from IPython.html.widgets import interact, fixed
from IPython.display import clear_output, display, HTML

Here is a Python function that solves the Lorenz systems using SciPy and plots the results using matplotlib:


In [38]:
from lorenz import solve_lorenz

In [39]:
t, x_t = solve_lorenz(N=10, angle=0.0, max_time=4.0, sigma=10.0, beta=8./3, rho=28.0)


Let's use interact to explore this function:


In [40]:
interact(solve_lorenz, angle=(0.,360.), N=(0,50), sigma=(0.0,50.0),
         rho=(0.0,50.0), beta=fixed(8./3));


Conclusion

The IPython Notebook enables users to tell reproducible stories that involve code and data

The scripts and command line programs used in the traditional parallel computing workflow can all be managed and run from within the Notebook

The Python ecosystem provides a rich foundation for data analysis, visualization, algorithm development, parallel development


In [41]:
%load_ext load_style

In [ ]:
%load_style talk.css

In [ ]: